Ordinal regression in brms

Let’s load some data from Experiment 2 of Truncating the Y-Axis: Threat or Menace? The experiment compares whether different visualization designs signaling that the axis has been truncated can counteract distortions in perceived differences between bars in a bar chart. Participants judge the magnitude of differences (i.e., effect size) on a 5-point Likert-style scale.

The data is at https://osf.io/5pjgb/, and the paper is at https://arxiv.org/pdf/1907.02035.pdf.

data <- read_csv("cleanrows2.csv")
## Parsed with column specification:
## cols(
##   visType = col_character(),
##   framing = col_character(),
##   dataSize = col_double(),
##   truncationLevel = col_double(),
##   trendDirection = col_double(),
##   slope = col_double(),
##   id = col_character(),
##   index = col_double(),
##   training = col_double(),
##   rt = col_double(),
##   timestamp = col_character(),
##   qTrend = col_double(),
##   qSeverity = col_double(),
##   correct = col_double(),
##   firstX = col_double(),
##   lastX = col_double(),
##   x0 = col_double(),
##   x1 = col_double(),
##   x2 = col_double()
## )

Let’s preprocess our data for the purpose of modeling. This mostly involves renaming some variables and/or casting some of them as factors. We also want to drop training trials from our data as Correll et al. did in the paper.

data <- data %>%
  mutate(
    vis = as.factor(visType),
    # truncation = as.factor(truncationLevel),
    # slope = as.factor(slope),
    complexity = as.factor(dataSize),
    subject = as.factor(id)) %>%
  rename(
    response = qSeverity,
    truncation = truncationLevel
  ) %>%
  filter(is.na(training))

We’re going to fit an ordinal regression model using the brms package. You can read about this kind of model and brms at https://osf.io/gyfj7/download. Basically, the model assumes that the ordinal response on a K-point scale is generated by partitioning a latent (unobserved) continuous variable using K - 1 thresholds. The latent scale represents the participant’s mental representation of perceived effect size in arbitrary units. The thresholds separating adjacent response categories are intercepts in our model, and their location on the latent scale is modified by each fixed and random effect parameter in our model.

The model specification we’ve chosen will enable us to make inferences about the impact of different visualization designs and different levels of axis truncation, controlling for the actual maginitude of the difference participants are judging, the number of bars shown, and individual variability in responses. Note that in addition to modeling the threshold locations on the latent scale, we are also modeling the variability in the latent scale (i.e., ability to discriminate the underlying signal), allowing it to vary per participant.

We’ll fit the model using the default priors in brms.

m1 <- brm(
  formula = bf(
    response ~ vis + truncation*slope + complexity + (1|subject),
    disc ~ (1|subject)),
  family = cumulative("probit"),
  chains = 2,
  cores = 2,
  iter = 2500,
  warmup = 1000,
  data = data,
  control = list(adapt_delta = 0.99),
  file = "ordinal-model1")

Let’s check some diagnostics.

In this summary table, we want to see Rhat values very close to 1.0 and Bulk_ESS in the thousands for proper inference. Rhat is a measure of chain mixing. We want to know that our chains are not getting stuck in local maxima of the parameter space in terms of log joint posterior densitity (i.e., what is being optimized in the sampling algorithm). When chains are getting stuck at local maxima, this is usually a sign that our parameter space is “lumpy” in its geometry, and we should probably reparameterize our model.

Bulk_ESS is effective sample size. This tells us how many functionally independent samples from the posterior we actually have. For joint posteriors with strong correlation between parameters, this can we quite a lot lower than the number of MCMC iterations after warmup. We can bump Bulk_ESS higher by running our model fitting process for more samples.

summary(m1)
##  Family: cumulative 
##   Links: mu = probit; disc = log 
## Formula: response ~ vis + truncation * slope + complexity + (1 | subject) 
##          disc ~ (1 | subject)
##    Data: data (Number of observations: 1152) 
## Samples: 2 chains, each with iter = 2500; warmup = 1000; thin = 1;
##          total post-warmup samples = 3000
## 
## Group-Level Effects: 
## ~subject (Number of levels: 32) 
##                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)         10.08      4.10     4.30    19.63 1.01      345      454
## sd(disc_Intercept)     0.28      0.05     0.18     0.39 1.00      885     1416
## 
## Population-Level Effects: 
##                  Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept[1]        24.02      9.54    10.39    46.86 1.01      318      426
## Intercept[2]        44.54     17.33    19.97    86.92 1.01      313      428
## Intercept[3]        63.12     24.52    28.59   123.69 1.01      313      406
## Intercept[4]        81.20     31.58    36.51   160.67 1.01      313      401
## disc_Intercept      -2.40      0.38    -3.15    -1.66 1.01      314      437
## visbottomgradbar    -0.77      1.08    -3.20     1.08 1.00     1636     1049
## visbrokenbar         0.22      1.04    -1.89     2.28 1.00     1717     1211
## truncation          35.91     15.66    14.37    73.57 1.01      367      453
## slope              169.82     67.01    75.76   329.80 1.01      323      398
## complexity3         -3.13      1.52    -6.98    -1.04 1.00      467      514
## truncation:slope    18.38     35.98   -45.74    99.04 1.00     1645     1227
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

Trace plots help us look at chain mixing in greater detail. This can be helpful for diagnosing exactly which parameters the sampler is getting stuck on.

plot(m1)

Pairs plots show us correlation between pairs of parameters. Perfectly correlated parameters are usually bad news. This means that our model cannot differentiate the effect of two redundant parameters, often because of issues of identifiability. This perfect correlation between parameters is called multicollinearity, and it usually means we need to reparameterize our model.

I usually separate these out into coherent sets of parameters since these plots don’t scale well.

pairs(m1, pars = c("b_Intercept[1]","b_Intercept[2]","b_Intercept[3]", "b_Intercept[4]", "b_disc_Intercept"), exact_match = TRUE)

Some of these intercepts look pretty strongly correlated, but it doesn’t seem to have caused any other issues with the model fit.

pairs(m1, pars = c("sd_subject__Intercept", "sd_subject__disc_Intercept"), exact_match = TRUE)

pairs(m1, pars = c("b_truncation", "b_slope"))

pairs(m1, pars = c("b_vis", "b_complexity"))

Let’s compare the empirical distribution of responses to the distribution predicted by the model.

Here’s the empirical distribution.

data %>%
  mutate(rating = ordered(response, levels = c("5","4","3","2","1"))) %>%
  ggplot(aes(x = vis, fill = rating)) +
    geom_bar() +
    theme_minimal() +
    facet_grid(. ~ truncation)

Here’s the predicted distribution. Note that the model produces many draws (i.e., samples) consistituting a predictive distribution for each observation in the data set. In order to make the counts match, we take the model’s median response on each trial.

data %>%
  add_predicted_draws(m1, seed = 14) %>%
  group_by(vis, truncation, index, subject) %>%
  summarize(rating = ordered(as.factor(median(as.numeric(.prediction))), levels =c("5","4","3","2","1"))) %>%
  ggplot(aes(x = vis, fill = rating)) +
    geom_bar() +
    theme_minimal() +
    facet_grid(. ~ truncation)

Notice how the average model prediction for each trial errs on the side of underestimating responses. This is just an artifact of taking the median of the distribution of predictions for each trial.

Now, instead of taking a median, let’s plot the predictive uncertainty of the model as animated hypothetical outcome plots (HOPs).

spec <- data %>%
  add_predicted_draws(m1, seed = 14, n = 50) %>%
  mutate(rating = ordered(.prediction, levels = c("5","4","3","2","1"))) %>%
  ggplot(aes(x = vis, fill = rating)) +
    geom_bar() +
    theme_minimal() +
    facet_grid(. ~ truncation) +
    transition_manual(.draw)

animate(spec, fps = 2.5)
## nframes and fps adjusted to match transition

This shows us in detail how uncertain the model is about the distribution of responses in each condition. However, it does not give us a sense of inferential uncertainty. One common way that researchers tend to make inferences with Likert-style responses is to model differences in average response. We can do the same thing with the output of the ordinal regression model, and we can do so without assuming that these discrete responses are the outcome of an untransformed linear model (e.g., ANOVA).

Let’s find the average response in each condition, conditioning on the average user (i.e., no longer including random effects in our predictions). Note that now we are taking the average rating in each condition for each draw, giving us a distribution of inferential uncertainty about average responses (i.e., a sampling distribution of average response) rather than a wider distribution of predictive uncertainty about non-aggregated responses.

data %>%
  add_predicted_draws(m1, seed = 14, re_formula = NA) %>%
  group_by(vis, truncation, .draw) %>%
  mutate(rating = weighted.mean(as.numeric(.prediction))) %>%
  ggplot(aes(x = vis, y = rating)) +
    stat_eye() +
    theme_minimal() +
    coord_cartesian(ylim = c(1.0, 3.5)) +
    facet_grid(. ~ truncation)

Let’s look at contrasts for the effect of visualization design

data %>%
  add_predicted_draws(m1, seed = 14, re_formula = NA) %>%
  group_by(vis, .draw) %>%
  summarize(rating = weighted.mean(as.numeric(.prediction))) %>%
  compare_levels(rating, by = vis) %>%
  rename(diff_in_rating = rating) %>%
  ggplot(aes(x = diff_in_rating, y = vis)) +
    stat_halfeyeh() +
    geom_vline(xintercept = 0, linetype = "longdash") +
    theme_minimal()

One nice thing about Bayesian analysis is that we can say that we do not find a reliable difference between visualization conditions without all of that “failing to reject the null” business.

And let’s also look at contrasts for the effect of axis truncation.

data %>%
  add_predicted_draws(m1, seed = 14, re_formula = NA) %>%
  group_by(truncation, .draw) %>%
  summarize(rating = weighted.mean(as.numeric(.prediction))) %>%
  compare_levels(rating, by = truncation) %>%
  rename(diff_in_rating = rating) %>%
  ggplot(aes(x = diff_in_rating, y = truncation)) +
    stat_halfeyeh() +
    geom_vline(xintercept = 0, linetype = "longdash") +
    theme_minimal()

These differences are really big, but we could see also see this in a few of the previous charts. Axis limits have a pretty robust influence on perceived effect size.

Troubleshooting

Let’s take a model that doesn’t fit and try to diagnose what’s wrong. This model uses a more complicated interaction term than the last one by assuming that complexity (number of bars) interacts with visualization design, truncation, and slope (effect size).

m2 <- brm(
  formula = bf(
    response ~ vis*truncation*slope*complexity + (1|subject),
    disc ~ (1|subject)),
  family = cumulative("probit"),
  chains = 2,
  cores = 2,
  iter = 2500,
  warmup = 1000,
  data = data,
  control = list(adapt_delta = 0.99),
  file = "ordinal-model2")

We can see from the summary that we had divergent chains (Rhat > 1.1).

summary(m2)
## Warning: The model has not converged (some Rhats are > 1.1). Do not analyse the results! 
## We recommend running more iterations and/or setting stronger priors.
##  Family: cumulative 
##   Links: mu = probit; disc = log 
## Formula: response ~ vis * truncation * slope * complexity + (1 | subject) 
##          disc ~ (1 | subject)
##    Data: data (Number of observations: 1152) 
## Samples: 2 chains, each with iter = 2500; warmup = 1000; thin = 1;
##          total post-warmup samples = 3000
## 
## Group-Level Effects: 
## ~subject (Number of levels: 32) 
##                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)      10436.07   4848.06  2178.07 19675.23 1.32        5       18
## sd(disc_Intercept)     0.27      0.05     0.17     0.38 1.04       61      160
## 
## Population-Level Effects: 
##                                                 Estimate Est.Error   l-95% CI
## Intercept[1]                                    17729.00   8980.76    3429.33
## Intercept[2]                                    37478.61  16959.18    8557.42
## Intercept[3]                                    56016.83  24851.76   12923.72
## Intercept[4]                                    74057.16  32516.51   16917.31
## disc_Intercept                                     -9.23      0.55      -9.97
## visbottomgradbar                                -1955.18   5823.15  -12526.13
## visbrokenbar                                     2484.27   6806.93   -7457.49
## truncation                                      28691.35  17235.30    1474.18
## slope                                          151410.02  68959.10   34821.93
## complexity3                                     -1516.24   3384.42   -7848.49
## visbottomgradbar:truncation                     -4528.49  16117.07  -36943.20
## visbrokenbar:truncation                         -5783.75  20466.11  -66440.86
## visbottomgradbar:slope                          17010.84  29916.62  -31585.23
## visbrokenbar:slope                              -3802.45  31598.95  -62575.73
## truncation:slope                                83948.34  63021.18   -9934.63
## visbottomgradbar:complexity3                    -2997.09   7009.19  -15167.79
## visbrokenbar:complexity3                       -10619.04   9193.59  -28274.06
## truncation:complexity3                          -9791.59  11637.24  -31852.37
## slope:complexity3                               -3752.87  15518.61  -31680.93
## visbottomgradbar:truncation:slope              -38760.26  77256.09 -180295.92
## visbrokenbar:truncation:slope                   -2523.47  93499.58 -143779.07
## visbottomgradbar:truncation:complexity3         36772.22  23776.86   -5774.30
## visbrokenbar:truncation:complexity3             39968.33  32294.41   -7122.51
## visbottomgradbar:slope:complexity3              12987.26  35836.76  -59694.29
## visbrokenbar:slope:complexity3                  46447.63  44228.60  -33907.37
## truncation:slope:complexity3                    20059.28  58561.50  -88471.74
## visbottomgradbar:truncation:slope:complexity3 -140961.72 104355.29 -311619.14
## visbrokenbar:truncation:slope:complexity3     -171145.73 155491.21 -470387.86
##                                                u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept[1]                                   35475.81 1.28        6       28
## Intercept[2]                                   68955.98 1.34        5       17
## Intercept[3]                                  102658.57 1.36        5       15
## Intercept[4]                                  133472.53 1.37        5       15
## disc_Intercept                                    -7.80 1.42        4       15
## visbottomgradbar                                8622.71 2.25        3       28
## visbrokenbar                                   17329.30 2.04        3       11
## truncation                                     58849.34 1.30        5       19
## slope                                         282245.40 1.32        5       19
## complexity3                                     4787.86 1.26        7       53
## visbottomgradbar:truncation                    24055.48 1.69        3       17
## visbrokenbar:truncation                        22460.41 2.09        3       11
## visbottomgradbar:slope                         75425.39 1.97        3       24
## visbrokenbar:slope                             50125.42 2.11        3       34
## truncation:slope                              204104.03 1.78        3       46
## visbottomgradbar:complexity3                   11192.09 1.56        4       16
## visbrokenbar:complexity3                        4622.95 1.58        3       24
## truncation:complexity3                         11245.90 1.42        4       42
## slope:complexity3                              27543.11 1.33        5       80
## visbottomgradbar:truncation:slope             105196.23 1.70        3       25
## visbrokenbar:truncation:slope                 229429.24 2.09        3       12
## visbottomgradbar:truncation:complexity3        75577.66 1.47        4       39
## visbrokenbar:truncation:complexity3           105046.33 1.78        3       11
## visbottomgradbar:slope:complexity3             71449.65 1.51        4       33
## visbrokenbar:slope:complexity3                129702.12 1.65        3       25
## truncation:slope:complexity3                  144635.72 1.35        6       20
## visbottomgradbar:truncation:slope:complexity3  54997.40 1.51        4       25
## visbrokenbar:truncation:slope:complexity3      86490.74 1.77        3       13
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

Let’s take a look at these divergent chains as trace plots.

plot(m2)

We can see that the chains are wandering around without converging. This makes me think that we could improve our model fit with some regularization (narrower priors).

Sometimes (not in this case) chains will get stuck sampling a single parameter value repeatedly. This usually signals issues with identifiability, suggesting that our model is mis-specified and should be reparameterized.

Let’s take a look at the default priors we’ve been using before we create our own.

get_prior(
  formula = bf(
    response ~ vis*truncation*slope*complexity + (1|subject),
    disc ~ (1|subject)),
  family = cumulative("probit"),
  data = data)
##                  prior     class                                          coef
## 1                              b                                              
## 2                              b                                   complexity3
## 3                              b                                         slope
## 4                              b                             slope:complexity3
## 5                              b                                    truncation
## 6                              b                        truncation:complexity3
## 7                              b                              truncation:slope
## 8                              b                  truncation:slope:complexity3
## 9                              b                              visbottomgradbar
## 10                             b                  visbottomgradbar:complexity3
## 11                             b                        visbottomgradbar:slope
## 12                             b            visbottomgradbar:slope:complexity3
## 13                             b                   visbottomgradbar:truncation
## 14                             b       visbottomgradbar:truncation:complexity3
## 15                             b             visbottomgradbar:truncation:slope
## 16                             b visbottomgradbar:truncation:slope:complexity3
## 17                             b                                  visbrokenbar
## 18                             b                      visbrokenbar:complexity3
## 19                             b                            visbrokenbar:slope
## 20                             b                visbrokenbar:slope:complexity3
## 21                             b                       visbrokenbar:truncation
## 22                             b           visbrokenbar:truncation:complexity3
## 23                             b                 visbrokenbar:truncation:slope
## 24                             b     visbrokenbar:truncation:slope:complexity3
## 25 student_t(3, 0, 10) Intercept                                              
## 26                     Intercept                                             1
## 27                     Intercept                                             2
## 28                     Intercept                                             3
## 29                     Intercept                                             4
## 30 student_t(3, 0, 10)        sd                                              
## 31                            sd                                              
## 32                            sd                                     Intercept
## 33        normal(0, 1) Intercept                                              
## 34 student_t(3, 0, 10)        sd                                              
## 35                            sd                                              
## 36                            sd                                     Intercept
##      group resp dpar nlpar bound
## 1                               
## 2                               
## 3                               
## 4                               
## 5                               
## 6                               
## 7                               
## 8                               
## 9                               
## 10                              
## 11                              
## 12                              
## 13                              
## 14                              
## 15                              
## 16                              
## 17                              
## 18                              
## 19                              
## 20                              
## 21                              
## 22                              
## 23                              
## 24                              
## 25                              
## 26                              
## 27                              
## 28                              
## 29                              
## 30                              
## 31 subject                      
## 32 subject                      
## 33              disc            
## 34              disc            
## 35 subject      disc            
## 36 subject      disc

We’ve got a bunch of slope parameters because of all of the interaction effects in our model. Let’s look at the first model (the one that fit) for comparison.

get_prior(
  formula = bf(
    response ~ vis*truncation*slope + complexity + (1|subject),
    disc ~ (1|subject)),
  family = cumulative("probit"),
  data = data)
##                  prior     class                              coef   group resp
## 1                              b                                               
## 2                              b                       complexity3             
## 3                              b                             slope             
## 4                              b                        truncation             
## 5                              b                  truncation:slope             
## 6                              b                  visbottomgradbar             
## 7                              b            visbottomgradbar:slope             
## 8                              b       visbottomgradbar:truncation             
## 9                              b visbottomgradbar:truncation:slope             
## 10                             b                      visbrokenbar             
## 11                             b                visbrokenbar:slope             
## 12                             b           visbrokenbar:truncation             
## 13                             b     visbrokenbar:truncation:slope             
## 14 student_t(3, 0, 10) Intercept                                               
## 15                     Intercept                                 1             
## 16                     Intercept                                 2             
## 17                     Intercept                                 3             
## 18                     Intercept                                 4             
## 19 student_t(3, 0, 10)        sd                                               
## 20                            sd                                   subject     
## 21                            sd                         Intercept subject     
## 22        normal(0, 1) Intercept                                               
## 23 student_t(3, 0, 10)        sd                                               
## 24                            sd                                   subject     
## 25                            sd                         Intercept subject     
##    dpar nlpar bound
## 1                  
## 2                  
## 3                  
## 4                  
## 5                  
## 6                  
## 7                  
## 8                  
## 9                  
## 10                 
## 11                 
## 12                 
## 13                 
## 14                 
## 15                 
## 16                 
## 17                 
## 18                 
## 19                 
## 20                 
## 21                 
## 22 disc            
## 23 disc            
## 24 disc            
## 25 disc

That’s a difference of 11 parameters from changing a + to a * in our model specification! This is one reason why it’s smart to build models up incrementally in order to understand more complex models in terms of simpler ones, a process called model expansion.

Let’s try placeing some narrower priors on these slope parameters. This should make it easier for the sampler to hone in on a specific region of parameter space.

m3 <- brm(
  formula = bf(
    response ~ vis*truncation*slope*complexity + (1|subject),
    disc ~ (1|subject)),
  family = cumulative("probit"),
  prior = prior(normal(0, 0.5), class = b),
  chains = 2,
  cores = 2,
  iter = 2500,
  warmup = 1000,
  data = data,
  control = list(adapt_delta = 0.99),
  file = "ordinal-model3")

Let’s check out model fit.

summary(m3)
##  Family: cumulative 
##   Links: mu = probit; disc = log 
## Formula: response ~ vis * truncation * slope * complexity + (1 | subject) 
##          disc ~ (1 | subject)
##    Data: data (Number of observations: 1152) 
## Samples: 2 chains, each with iter = 2500; warmup = 1000; thin = 1;
##          total post-warmup samples = 3000
## 
## Group-Level Effects: 
## ~subject (Number of levels: 32) 
##                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)          0.15      0.03     0.10     0.22 1.00      541      940
## sd(disc_Intercept)     0.27      0.05     0.18     0.38 1.00     1058     1956
## 
## Population-Level Effects: 
##                                               Estimate Est.Error l-95% CI
## Intercept[1]                                      0.31      0.06     0.20
## Intercept[2]                                      0.61      0.10     0.43
## Intercept[3]                                      0.89      0.14     0.63
## Intercept[4]                                      1.16      0.18     0.83
## disc_Intercept                                    1.74      0.17     1.42
## visbottomgradbar                                 -0.03      0.06    -0.15
## visbrokenbar                                     -0.02      0.06    -0.14
## truncation                                        0.54      0.12     0.31
## slope                                             2.17      0.34     1.53
## complexity3                                      -0.07      0.06    -0.19
## visbottomgradbar:truncation                      -0.11      0.12    -0.36
## visbrokenbar:truncation                          -0.08      0.12    -0.32
## visbottomgradbar:slope                            0.25      0.26    -0.25
## visbrokenbar:slope                                0.22      0.25    -0.25
## truncation:slope                                  0.60      0.37    -0.10
## visbottomgradbar:complexity3                      0.02      0.08    -0.12
## visbrokenbar:complexity3                         -0.03      0.07    -0.18
## truncation:complexity3                           -0.03      0.12    -0.27
## slope:complexity3                                 0.15      0.24    -0.30
## visbottomgradbar:truncation:slope                -0.22      0.41    -1.04
## visbrokenbar:truncation:slope                    -0.08      0.41    -0.88
## visbottomgradbar:truncation:complexity3           0.15      0.16    -0.16
## visbrokenbar:truncation:complexity3               0.14      0.15    -0.15
## visbottomgradbar:slope:complexity3               -0.09      0.31    -0.72
## visbrokenbar:slope:complexity3                    0.05      0.30    -0.54
## truncation:slope:complexity3                     -0.15      0.41    -0.90
## visbottomgradbar:truncation:slope:complexity3    -0.24      0.46    -1.14
## visbrokenbar:truncation:slope:complexity3        -0.23      0.45    -1.11
##                                               u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept[1]                                      0.45 1.00      650     1437
## Intercept[2]                                      0.82 1.00      561     1219
## Intercept[3]                                      1.17 1.00      526     1060
## Intercept[4]                                      1.54 1.00      514      983
## disc_Intercept                                    2.08 1.00      529      890
## visbottomgradbar                                  0.08 1.00      993     1513
## visbrokenbar                                      0.09 1.00     1195     1829
## truncation                                        0.80 1.00      794     1399
## slope                                             2.86 1.00      649     1346
## complexity3                                       0.04 1.01      952     1290
## visbottomgradbar:truncation                       0.12 1.00     1238     1839
## visbrokenbar:truncation                           0.15 1.00     1411     1631
## visbottomgradbar:slope                            0.79 1.00     1005     1611
## visbrokenbar:slope                                0.73 1.00     1231     1874
## truncation:slope                                  1.35 1.00     2136     1907
## visbottomgradbar:complexity3                      0.17 1.00     1094     1555
## visbrokenbar:complexity3                          0.11 1.00     1281     2004
## truncation:complexity3                            0.20 1.00     1300     1687
## slope:complexity3                                 0.64 1.01     1262     1675
## visbottomgradbar:truncation:slope                 0.57 1.00     2029     2131
## visbrokenbar:truncation:slope                     0.72 1.00     2001     1770
## visbottomgradbar:truncation:complexity3           0.46 1.00     1386     1867
## visbrokenbar:truncation:complexity3               0.44 1.00     1502     1919
## visbottomgradbar:slope:complexity3                0.51 1.00     1177     1808
## visbrokenbar:slope:complexity3                    0.66 1.00     1601     1772
## truncation:slope:complexity3                      0.67 1.00     1932     2285
## visbottomgradbar:truncation:slope:complexity3     0.66 1.00     2316     1887
## visbrokenbar:truncation:slope:complexity3         0.62 1.00     2459     2066
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).
plot(m3)

This looks much better! Why did changing the prior work?

Imagine the sampler as a marble dropped into a bowl and sweeping out a path along the bowl’s surface. In this common metafor, the marble is the sampler, the surface of the bowl is parameter space, and discrete locations on marble’s path are our posterior draws. When we add many parameters to a model, as we did when we added multiple interactions to our model, we make an n-dimensional surface that can become “lumpy” with local extrema making it difficult for the model to converge. By setting regularizing priors we smooth the surface of the parameter space, making it easier for chains to converge.

Another way of thinking about this is that adding parameters to a model entails an assumption that there are more sources of variation. As we add more sources of variation, we need to assume that each individual parameter is going to contribute less variance to the posterior distribution. Otherwise, we tell the sampler to explore an overdispersed parameter space, leading to the wandering chains we saw before.

Let’s compare the empirical distribution of responses to the distribution predicted by the model.

Here’s the empirical distribution.

data %>%
  mutate(rating = ordered(response, levels = c("5","4","3","2","1"))) %>%
  ggplot(aes(x = vis, fill = rating)) +
    geom_bar() +
    theme_minimal() +
    facet_grid(. ~ truncation)

Here’s the posterior predictive distribution as HOPs.

spec <- data %>%
  add_predicted_draws(m3, seed = 14, n = 50) %>%
  mutate(rating = ordered(.prediction, levels = c("5","4","3","2","1"))) %>%
  ggplot(aes(x = vis, fill = rating)) +
    geom_bar() +
    theme_minimal() +
    facet_grid(. ~ truncation) +
    transition_manual(.draw)

animate(spec, fps = 2.5)
## nframes and fps adjusted to match transition

This looks pretty similar to the simpler model we fit previously. Let’s compare the two models that fit well using leave-one-out (llo) crossvalidation, which estimates out-of-sample predictive validity as expected log posterior density (elpd). As a rule of thumb, smaller values of elpd reflect a better fit.

loo(m1, m3)
## Warning: Found 1 observations with a pareto_k > 0.7 in model 'm1'. It is
## recommended to set 'reloo = TRUE' in order to calculate the ELPD without the
## assumption that these observations are negligible. This will refit the model 1
## times to compute the ELPDs for the problematic observations directly.
## Warning: Found 1 observations with a pareto_k > 0.7 in model 'm3'. It is
## recommended to set 'reloo = TRUE' in order to calculate the ELPD without the
## assumption that these observations are negligible. This will refit the model 1
## times to compute the ELPDs for the problematic observations directly.
## Output of model 'm1':
## 
## Computed from 3000 by 1152 log-likelihood matrix
## 
##          Estimate   SE
## elpd_loo  -1105.9 23.6
## p_loo        58.7  3.2
## looic      2211.9 47.1
## ------
## Monte Carlo SE of elpd_loo is NA.
## 
## Pareto k diagnostic values:
##                          Count Pct.    Min. n_eff
## (-Inf, 0.5]   (good)     1144  99.3%   970       
##  (0.5, 0.7]   (ok)          7   0.6%   254       
##    (0.7, 1]   (bad)         1   0.1%   441       
##    (1, Inf)   (very bad)    0   0.0%   <NA>      
## See help('pareto-k-diagnostic') for details.
## 
## Output of model 'm3':
## 
## Computed from 3000 by 1152 log-likelihood matrix
## 
##          Estimate   SE
## elpd_loo  -1115.4 23.3
## p_loo        68.7  3.5
## looic      2230.7 46.6
## ------
## Monte Carlo SE of elpd_loo is NA.
## 
## Pareto k diagnostic values:
##                          Count Pct.    Min. n_eff
## (-Inf, 0.5]   (good)     1144  99.3%   670       
##  (0.5, 0.7]   (ok)          7   0.6%   270       
##    (0.7, 1]   (bad)         1   0.1%   181       
##    (1, Inf)   (very bad)    0   0.0%   <NA>      
## See help('pareto-k-diagnostic') for details.
## 
## Model comparisons:
##    elpd_diff se_diff
## m1  0.0       0.0   
## m3 -9.4       2.5

It looks like the more complex model (m3) fits slightly better, even after we adjust for overfitting. It is common practice to run a handful of model, expanding from simpler to more complex specifications, and compare them all using posterior predictive visualizations and metrics like loo cross-validation.